Lecture’s Plan

  1. What is text clustering?
  2. What are the applications?
  3. How to cluster text data?

Unsupervised learning

Clustering v.s. Classification

Clustering

  • Discover “natural structure” of data
    • What is the criterion?
    • How to identify them?
    • How to evaluate the results?

Clustering

  • Clustering - the process of grouping a set of objects into clusters of similar objects
    • Basic criteria
      • high intra-cluster similarity
      • low inter-cluster similarity
    • No (little) supervision signal about the underlying clustering structure
    • Need similarity/distance as guidance to form clusters

Applications of text clustering

  • Organize document collections
    • Automatically identify hierarchical/topical relation among documents

Applications of text clustering

  • Grouping search results
    • Organize documents by topics
    • Facilitate user browsing

Applications of text clustering

  • Topic modeling
    • Grouping words into topics

Distance metric

Distance metric

  • Basic properties

    • Positive separation \[𝐷(x,y)>0, \forall x \neq y\] \[𝐷(x,y)=0, \mathrm{i.f.f.}, x=y\]

    • Symmetry \[𝐷(x,y)=𝐷(y,x)\]

    • Triangle inequality \[𝐷(x,y)≤𝐷(x,z)+𝐷(z,y)\]

Typical distance metric

  • Minkowski metric

    • \(d(x,y) = \sqrt[p]{\sum^V_{i=1}{(x_i-y_i)^p}}\)

      • When \(p=2\), it is Euclidean distance
  • Cosine metric

    • \(𝑑(x,y)=1−cosine(x,y)\)

      • when \(|x|^2=|y|^2=1\), \(1−cosine(x,y)=\frac{r^2}{2}\)

Typical distance metric

  • Edit distance

    • Count the minimum number of operations required to transform one string into the other

      • Possible operations: insertion, deletion and replacement





← Can be efficiently solved by dynamic programming

Typical distance metric

  • Edit distance

    • Count the minimum number of operations required to transform one string into the other

      • Possible operations: insertion, deletion and replacement
    • Extent to distance between sentences

      • Word similarity as cost of replacement

        • “terrible” -> “bad”: low cost → Lexicon or distributional semantics

        • “terrible” -> “terrific”: high cost → Lexicon or distributional semantics

      • Preserving word order in distance computation

Clustering algorithms

Clustering algorithms

  1. Partitional clustering algorithms
    • Partition the instances into different groups
    • Flat structure
      • Need to specify the number of classes in advance

Clustering algorithms

  1. Hierarchical clustering algorithms
    • Create a hierarchical decomposition of objects
    • Rich internal structure
      • No need to specify the number of clusters
      • Can be used to organize objects

Clustering algorithms

  1. Topic modeling
    • Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts.
    • We want to find themes (or topics) in documents
    • We don’t want to do supervised topic classification – rather not fix topics in advance nor do manual annotation
    • Need an approach which automatically teases out the topics
    • This is essentially a clustering problem - can think of both words and documents as being clustered

Hard vs. soft clustering

  • Hard clustering: Each document belongs to exactly one cluster
    • More common and easier to do
  • Soft clustering: A document can belong to more than one cluster.
    • Makes more sense for applications like creating browsable hierarchies
    • You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes
    • You can only do that with a soft clustering approach.

Partitioning clustering

Partitioning Algorithms

  • Partitioning method: Construct a partition of \(n\) documents into a set of \(K\) clusters

  • Given: a set of documents and the number \(K\)

  • Find: a partition of \(K\) clusters that optimizes the chosen partitioning criterion

    • Globally optimal
      • Intractable for many objective functions
      • Ergo, exhaustively enumerate all partitions
    • Effective heuristic methods: K-means and K-medoids algorithms

Partitioning Algorithms

  • Typical partitional clustering algorithms

    • k-means clustering

      • Partition data by its closest mean

Partitioning Algorithms

  • Typical partitional clustering algorithms
    • k-means clustering
      • Partition data by its closest mean
    • Gaussian Mixture Model
      • Consider variance within the cluster as well

K-Means

  • Assumes documents are real-valued vectors.

  • Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, \(c\):

\[\vec \mu(c)=\frac{1}{|c|}\sum_{\vec a \in c}{\vec x}\]

  • Reassignment of instances to clusters is based on distance to the current cluster centroids.

    • (Or one can equivalently phrase it in terms of similarities)

K-Means Algorithm

  • Select \(K\) random docs \(\{s_1, s_2,… s_K\}\) as seeds.

  • Until clustering converges (or other stopping criterion):

    • For each doc \(d_i\):

      • Assign \(d_i\) to the cluster \(c_j\) such that \(dist(x_i, s_j)\) is minimal.
    • (Next, update the seeds to the centroid of each cluster)

    • For each cluster \(c_j\)

      • \(s_j = \mu(c_j)\)

K Means Example (K=2)

K Means Example (K=2)

K Means Example (K=2)

K Means Example (K=2)

K Means Example (K=2)

K Means Example (K=2)

K Means Example (K=2)

K Means Example (K=2)

Termination conditions

  • Several possibilities, e.g.,

    • A fixed number of iterations.

    • Doc partition unchanged.

    • Centroid positions don’t change.

Convergence

  • Why should the K-means algorithm ever reach a fixed point?

    • A state in which clusters don’t change.
  • K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.

    • EM is known to converge.

    • Number of iterations could be large.

      • But in practice usually isn’t

Seed Choice

  • Results can vary based on random seed selection.

  • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.

    • Select good seeds using a heuristic (e.g., doc least similar to any existing mean)

    • Try out multiple starting points

    • Initialize with the results of another method.

K-means issues, variations, etc.

  • Recomputing the centroid after every assignment (rather than after all points are re-assigned) can improve speed of convergence of K-means

  • Assumes clusters are spherical in vector space

    • Sensitive to coordinate changes, weighting etc.
  • Disjoint and exhaustive

    • Doesn’t have a notion of “outliers” by default

    • But can add outlier filtering

MBK

DBScan

Hierarchical clustering

Dendrogram: Hierarchical Clustering

  • Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.

  • One approach: recursive application of a partitional clustering algorithm.

Dendrogram: Hierarchical Clustering

  • Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

Clustering algorithms

  • Typical hierarchical clustering algorithms

    • Bottom-up agglomerative clustering

      • Start with individual objects as separated clusters
      • Repeatedly merge closest pair of clusters

Clustering algorithms

  • Typical hierarchical clustering algorithms

    • Top-down divisive clustering

      • Start with all data as one cluster

      • Repeatedly splitting the remaining clusters into two

Hierarchical Agglomerative Clustering (HAC)

  • Starts with each doc in a separate cluster

    • then repeatedly joins the closest pair of clusters, until there is only one cluster.
  • The history of merging forms a binary tree or hierarchy.

Closest pair of clusters

  • Many variants to defining closest pair of clusters

    • Single-link
      • Similarity of the most cosine-similar (single-link)
    • Complete-link
      • Similarity of the “furthest” points, the least cosine-similar
    • Centroid
      • Clusters whose centroids (centers of gravity) are the most cosine-similar
    • Average-link
      • Average cosine between pairs of elements

Single Link Agglomerative Clustering

Complete Link

Matrix Factorization

Dimensionality Reduction and Matrix Factorization

\[D \approx UV^T\]

Maximize similarity between entries of \(D\) and \(UV^T\)

      subject to:

      Constraints on \(U\) and \(V\)

Singular Value Decomposition

Example SVD

lexicon: lion, tiger, cheetah, jaguar, porsche, ferrari

\[ D = \begin{pmatrix} & \text{lion} & \text{tiger} & \text{cheetah} & \text{jaguar} & \text{porsche} & \text{ferrari} \\ \text{Document-1} & 2 & 2 & 1 & 2 & 0 & 0 \\ \text{Document-2} & 2 & 3 & 3 & 3 & 0 & 0 \\ \text{Document-3} & 1 & 1 & 1 & 1 & 0 & 0 \\ \text{Document-4} & 2 & 2 & 2 & 3 & 1 & 1 \\ \text{Document-5} & 0 & 0 & 0 & 1 & 1 & 1 \\ \text{Document-6} & 0 & 0 & 0 & 2 & 1 & 2 \end{pmatrix} \]

Example SVD

\[ \begin{align} D &\approx Q\Sigma P^T \\ &\approx \begin{pmatrix} -0.41 & 0.17 \\ -0.65 & 0.31 \\ -0.23 & 0.13 \\ -0.56 & -0.20 \\ -0.10 & -0.46 \\ -0.19 & -0.78 \end{pmatrix} \begin{pmatrix} 8.4 & 0 \\ 0 & 3.3 \end{pmatrix} \begin{pmatrix} -0.41 & -0.49 & -0.44 & -0.61 & -0.10 & -0.12 \\ 0.21 & 0.31 & 0.26 & -0.37 & -0.44 & -0.68 \end{pmatrix} \\ &= \begin{pmatrix} 1.55 & 1.87 & 1.67 & 1.91 & 0.10 & 0.04 \\ 2.46 & 2.98 & 2.66 & 2.95 & 0.10 & -0.03 \\ 0.89 & 1.08 & 0.96 & 1.04 & 0.01 & -0.04 \\ 1.81 & 2.11 & 1.91 & 3.14 & 0.77 & 1.03 \\ 0.02 & -0.05 & -0.02 & 1.06 & 0.74 & 1.11 \\ 0.10 & -0.02 & 0.04 & 1.89 & 1.28 & 1.92 \end{pmatrix} \end{align} \]

Advantages and Disadvantages of SVD/LSA

  1. The orthogonal basis representation of SVD is useful for folding in the reduced representation of new documents not included in the data matrix \(D\). For example, if \(\vec X\) is a row vector of a new document, then its reduced representation is givenn by the k-dimensional vector \(\vec XV\). This type of out-of-sample embedding is harder (albeit possible) with other forms of matrix factorization.

  2. The SVD solution provides the same error as unconstrained matrix factorization problem. Since most other forms of dimensionality reductiona are constrained matrix factorization problems, one can typically achieve a lower residual error with SVD at the same value of the rank \(k\).

Advantages and Disadvantages of SVD/LSA

  1. The topics of a text collection are often highly overlapping in terms of their vocabulary. As a result, the directions represented by the various topics are naturally not orthogonal, which matches poorly with orthogonal basis vectors. SVD does a poor job at revealing the actual semantic topics (or clusters) in the underlying data. Most forms of nonnegative matrix factorization that do not use orthogonal basis vectors are more adept at representing the clustering structure in the underlying data.
  2. The representation provided by SVD is not very interpretable and it is hard to match with the semantic concepts in the collection. A key part of the problem is that the eigenvectors contain both positive and negative components that are hard to interpret.

Nonnegative Matrix Factorization

lexicon: lion, tiger, cheetah, jaguar, porsche, ferrari  

Cats: lion, tiger, cheetah, jaguar

Cars: japuar, porche, ferrari

\[ D = \begin{pmatrix} & \text{lion} & \text{tiger} & \text{cheetah} & \text{jaguar} & \text{porsche} & \text{ferrari} \\ \text{Document-1} & 2 & 2 & 1 & 2 & 0 & 0 \\ \text{Document-2} & 2 & 3 & 3 & 3 & 0 & 0 \\ \text{Document-3} & 1 & 1 & 1 & 1 & 0 & 0 \\ \text{Document-4} & 2 & 2 & 2 & 3 & 1 & 1 \\ \text{Document-5} & 0 & 0 & 0 & 1 & 1 & 1 \\ \text{Document-6} & 0 & 0 & 0 & 2 & 1 & 2 \end{pmatrix} \]

Nonnegative Matrix Factorization

Nonnegative Matrix Factorization

Advantages and Disadvantages of Nonnegative Matrix Factorization

  1. Nonnegativity enables a highly interpretable decomposition because of ability to represent the factorization as a sum of parts.
  2. The semantic clusters (or topics) are often captured more accurately by allowing non-orthogonality in the basis vectors. This is because semantic topics are often related.
  3. Nonnegative matrix factorization can better address polysemy than SVD.
  4. One disadvantage of nonnegative matrix factorization is that it is harder (than SVD) to compute the reduced representations of documents that were not included in the original data matrix \(D\). SVD is able to fold in such documents more easily as a simple projection because of its orthogonal basis system.

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis

\[ \begin{align} \text{Maximize}_{(P,Q,\Sigma)}&[\text{Log likelihood of generating }D \text{ using parameters in matrices }(P,Q,\Sigma)] \\ &= log(\prod_{i,j}{P(\text{Adding one occurnce of term }j \text{ in document }i)^{D_{ij}}}) \\ &= \sum_{i=1}^n\sum_{j=1}^d{D_{ij}} \begin{matrix} \underbrace{ log(P(\vec X_i, t_j)) } \\ \text{Parametrized by }P,Q,\Sigma \end{matrix} \\ &\text{subject to:} \\ &P,Q,\Sigma \geq 0 \\ &\text{Entries in each column of } P \text{ sum to 1} \\ &\text{Entries in each column of } Q \text{ sum to 1} \\ &\Sigma \text{ is a diagonal matrix that sums to 1} \end{align} \]

Comparison with SVD

Example of PLSA

Topic Modeling

Topic models

  • Three concepts: words, topics, and documents
  • Documents are a collection of words and have a probability distribution over topics
  • Topics have a probability distribution over words
  • Model:
    • Topics made up of words used to generate documents

Topic models (lda or gensim)

Reality: Documents observed, infer topics

Probabilistic modeling

  1. Treat data as observations that arise from a generative probabilistic process that includes hidden variables: For documents, the hidden variables reflect the thematic structure of the collection.

  2. Infer the hidden structure using posterior inference: What are the topics that describe this collection?

  3. Situate new data into the estimated model: How does this query or new document fit into the estimated topic structure?

Intro to Latent Dirichlet Allocation (LDA)

Cluster Validation

Desirable properties of clustering algorithms

  • Scalability

    • Both in time and space
  • Ability to deal with various types of data

    • No/less assumption about input data

    • Minimal requirement about domain knowledge

  • Interpretability and usability

Cluster validation

  • Criteria to determine whether the clusters are meaningful

    • Internal validation

      • Stability and coherence
    • External validation

      • Match with known categories

Internal validation

  • Coherence

    • Inter-cluster similarity v.s. intra-cluster similarity

    • Davies–Bouldin index

      • \(DB = \frac{1}{k}\sum_{i=1}^k{\underset{j \neq i}{\operatorname{max}}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}\) Evaluate every pair of clusters

        • where \(k\) is total number of clusters, \(\sigma_i\) is average distance of all elements in cluster \(i\), \(d(c_i, c_j)\) is the distance between cluster centroid \(c_i\) and \(c_j\).

We prefer smaller DB-index!

Internal validation

  • Coherence

    • Inter-cluster similarity v.s. intra-cluster similarity

    • Davies–Bouldin index

      • \(DB = \frac{1}{k}\sum_{i=1}^k{\underset{j \neq i}{\operatorname{max}}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}\) Evaluate every pair of clusters

        • where \(k\) is total number of clusters, \(\sigma_i\) is average distance of all elements in cluster \(i\), \(d(c_i, c_j)\) is the distance between cluster centroid \(c_i\) and \(c_j\).

We prefer smaller DB-index!

External validation

 

  • Given class label \(\Omega\) ( Required, might need extra cost) on each instance

    • Purity: correctly clustered documents in each cluster

      • \(purity(\Omega,C) = \frac{1}{N}\sum_{i=1}^k{\underset{j}{\operatorname{max}}|c_i \cap w_j|}\) ← Not a good metric if we assign each document into a single cluster
        • where \(c_i\) is a set of documents in cluster \(i\), and \(w_j\) is a set of documents in class \(j\)

\[purity(\Omega, C) = \\ \frac{1}{17}(5 + 4 + 3)\]

External validation

  • Given class label Ω on each instance

    • Normalized mutual information (NMI)

      • \(NMI(\Omega, C) = \frac{I(\Omega, C)}{[H(\Omega)+H(C)]/2}\) ↙ Normalization by entropy will penalize too many clusters
        • where \(I(\Omega, C) = \sum_i \sum_j P(w_i \cap c_j)log{\frac{P(w_i \cap c_j)}{P(w_i)P(c_j)}} \\ H(\Omega) = - \sum_i{P(w_i) logP(w_i)} \ \ \ \mathrm{and} \ \ \ H(C) = - \sum_j{P(c_j)logP(c_j)}\)
      • Indicate the increase of knowledge about classes when we know the clustering results

External validation

  • Given class label \(\Omega\) on each instance

    • Rand index

      • Idea: we want to assign two documents to the same cluster if and only if they are from the same class

      • \(RI = \frac{TP + TN}{TP + FP + FN + TN}\) ← Essentially it is like classification accuracy

External validation

  • Given class label \(\Omega\) on each instance
    • Rand index

External validation

  • Given class label \(\Omega\) on each instance

    • Precision/Recall/F-measure

      • Based on the contingency table, we can also define precision/recall/F-measure of clustering quality

Group Average

  • Similarity of two clusters = average similarity of all pairs within merged cluster.

\[sim(c_i, c_j) = \frac{1}{|c_i \cup c_j|(|c_i \cup c_j| - 1)}\sum_{\vec x \in (c_i \cup c_j)} \sum_{\vec y \in (c_i \cup c_j): \vec y \neq \vec x}{sim(\vec x, \vec y)}\]

  • Compromise between single and complete link.

  • Two options:

    • Averaged across all ordered pairs in the merged cluster

    • Averaged over all pairs between the two original clusters

  • No clear difference in efficacy

Computing Group Average Similarity

  • Always maintain sum of vectors in each cluster.

\[\vec s(c_j) = \sum_{\vec x \in c_j}{\vec x}\] - Compute similarity of clusters in constant time:

\[sim(c_i, c_j) = \frac{\vec s(c_i) + \vec s(c_j) \cdot (\vec s(c_j) + \vec s(c_j)) - (|c_i| + |c_j|)}{(|c_i| + |c_j|)(|c_i| + |c_j| - 1)}\]

What Is A Good Clustering?

  • Internal criterion: A good clustering will produce high quality clusters in which:

    • the intra-class (that is, intra-cluster) similarity is high

    • the inter-class similarity is low

    • The measured quality of a clustering depends on both the document representation and the similarity measure used

External criteria for clustering quality

  • Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data

  • Assesses a clustering with respect to ground truth … requires labeled data

  • Assume documents with \(C\) gold standard classes, while our clustering algorithms produce \(K\) clusters, \(\omega_1\), \(\omega_2\), …, \(\omega_K\) with \(n_i\) members.

External Evaluation of Cluster Quality

  • Simple measure: purity, the ratio between the dominant class in the cluster \(\pi_i\) and the size of cluster \(\omega_i\)

\[Purity(\omega_i) = \frac{1}{n_i}\max_j(n_{ij}) \ \ \ j \in C\]

  • Biased because having n clusters maximizes purity

  • Others are entropy of classes in clusters (or mutual information between classes and clusters)

Rand index and Cluster F-measure

\[RI = \frac{A + D}{A + B + C + D}\]
Compare with standard Precision and Recall:
\[P= \frac{A}{A + B} \ \ \ \ \ \ P = \frac{A}{A + C}\]
People also define and use a cluster F-measure, which is probably a better measure

Summary

Summary: what did we learn?

  • Text clustering

  • In clustering, clusters are inferred from the data without human input (unsupervised learning)

  • However, in practice, it’s a bit less clear: there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents

  • Evaluation

Time for Practical 4!